List of AI News about SWE Bench
| Time | Details |
|---|---|
|
2026-04-09 18:28 |
Claude Sonnet Plus Opus Advisor Boosts SWE-bench Multilingual by 2.7 Points at 11.9% Lower Cost — Latest Evaluation Analysis
According to @claudeai on Twitter, Sonnet paired with an Opus advisor achieved a 2.7 percentage point higher score on SWE-bench Multilingual than Sonnet alone while reducing per-task cost by 11.9%. As reported by the Claude account post, this advisor-enhanced workflow indicates measurable quality gains and cost efficiency in multilingual software engineering benchmarks. For AI product teams, the data suggests a practical orchestration strategy: route primary reasoning to Sonnet and use Opus selectively for guidance to improve pass rates and lower run-time spending. According to the tweet, these results come from evals on SWE-bench Multilingual, highlighting a repeatable method for cost-aware performance optimization in LLM-based coding assistants. |
|
2026-02-27 12:10 |
MiniMax M2.5 Beats Opus 4.6 on SWE-Bench Verified: 80.2% Score, 3x Faster, $1 Hour—AI Coding Benchmark Analysis
According to God of Prompt on X (Twitter), MiniMax M2.5 surpassed Opus 4.6 on the SWE-Bench Verified benchmark with an 80.2% score, delivers roughly 3x faster execution, and is offered at a flat $1 per hour, while using only 10B activated parameters, positioning it as the smallest Tier-1 model for coding tasks. As reported by the same source, these metrics imply lower latency and significantly reduced inference cost, enabling 24/7 autonomous coding agents and continuous integration bots at practical budgets. According to the post, the combination of high benchmark accuracy and small active parameter count suggests strong efficiency-per-dollar, which can improve ROI for software teams deploying code assistants, test repair bots, and maintenance agents in production pipelines. |